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Abstract: In this paper a stochastic generahsation of the standard Linde-Buzo- 
Gray (LBG) approach to vector quantiser (VQ) design is presented, in which 
the encoder is implemented as the sampUng of a vector of code indices from 
a probabiHty distribution derived from the input vector, and the decoder is 
implemented as a superposition of reconstruction vectors. This stochastic VQ 
(SVQ) is optimised using a minimum mean Euclidean reconstruction distortion 
criterion, as in the LBG case. Numerical simulations are used to demonstrate 
how this leads to self-organisation of the SVQ, where different stochastically 
sampled code indices become associated with different input subspaces. 

1 Introduction 

In vector quantisation a code book is used to encode each input vector as a 
corresponding code index, which is then decoded (again, using the codebook) 
to produce an approximate reconstruction of the original input vector ^Ej- The 
purpose of this paper is to generalise the standard approach to vector quantiser 
(VQ) design P], so that each input vector is encoded as a vector of code indices 
that are stochastically sampled from a probability distribution that depends on 
the input vector, rather than as a single code index that is the deterministic 
outcome of finding which entry in a code book is closest to the input vector. 
This will be called a stochastic VQ (SVQ), and it includes the standard VQ as 
a special case. Note that this approach is different from the various stochastic 
approches that are used to train VQs (see e.g. j4l 1^1 El), because here the 
codebook itself is stochastic, so the use of probability distributions is essential 
both during and after training. 

One advantage of using the stochastic approach, which will be demonstrated 
in this paper, is that it automates the process of splitting high-dimensional input 
vectors into low-dimensional blocks before encoding them, because minimising 
the mean EucHdean reconstruction error can encourage different stochastically 
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sampled code indices to become associated with different input subspaces [J]. 
Another advantage is that it is very easy to connect SVQs together, by using 
the vector of code index probabiHties computed by one SVQ as the input vector 
to another SVQ [8j. 

In Section [21 various pieces of previously published theory are unified to give 
a coherent account of SVQs. In Section El the results of some new numerical 
simulations are presented, which demonstrate how the code indices in a SVQ 
can become associated in various ways with input subspaces. In the appendices 
various derivations relating to the detailed training of an SVQ are presented. 

2 Theory 

In this section various pieces of previously pubHshed theory are unified to estab- 
Hsh a coherent framework for modelling SVQs. In Section |0] the basic theory 
of folded Markov chains (FMC) is given [HI, and in Section IT2I it is extended to 
the case of high-dimensional input data jTHI- Finally, in Section [231 the theory 
is further generalised to chains of linked FMCs 0. 

2.1 Folded Markov Chains 

The basic building block of the encoder/decoder model used in this paper is 
the folded Markov chain (FMC) jH]. Thus an input vector x is encoded as a 
code index vector y, which is then subsequently decoded as a reconstruction x' 
of the input vector. Both the encoding and decoding operations are allowed to 
be probabilistic, in the sense that ?/ is a sample drawn from Pr(j/|a;), and x' is 
a sample drawn from Pr(a;'|y), where Pr(j/|a;) and Pr(a;'|y) are Bayes' inverses 
of each other, as given by Pr(a;'|j/) = j Tzpl(l^z)Vr(z) ' ^^^^ P^(^) 
probability from which x was sampled. Because the chain of dependences in 
passing from x to y and then to x' is first order Markov (i.e. it is described by 
the directed graph (x — > y — > x'), and because the two ends of this Markov 
chain (i.e. x and x') live in the same vector space, it is called a folded Markov 
chain (FMC). The operations that occur in an FMC are summarised in Figure 

m 

In order to ensure that the FMC encodes the input vector optimally, a mea- 
sure of the reconstruction error must be minimised. There are many possible 
ways to define this measure, but one that is consistent with many previous 
results, and which also leads to many new results, is the mean EucHdean recon- 
struction error measure D, which is defined as 

„ M M M „ 

D= j dxVr{x)^^ ■■■^VT:{y\x) j dx'Vr{x'\y)\\x-x'\f (1) 

where y — {yi,y2, ■ ■ ■ , Vn), ^ < Vi < M is assumed, Pr(x) Pr{y\x) Pr{x'\y) is the 
joint probability that the FMC has state {x,y,x'), \\x — x'\\^ is the Euclidean 
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Figure 1: A folded Markov chain (FMC) in which an input vector x is encoded 
as a code index vector y that is drawn from a conditional probability Pi{y\x), 
which is then decoded as a reconstruction vector x' drawn from the Bayes' 
inverse conditional probability Pr(a;'|2;). 

M M M 

reconstruction error, and J dx J2 ' ' ' J2 I dx'{- ■ ■ ) sums over all possible 

yi=l 1(2=1 y„=l 

states of the FMC (weighted by the joint probability). 

The Bayes' inverse probability Pr{x'\y) may be integrated out of this expres- 
sion for D to yield 

M M M 

D = 2 dxV^ix) Pr{y\x)\\x~x'{y)\\' (2) 

i/l=l 2/2=1 a„ = l 

where the reconstruction vector x'{y) is defined as x'{y) = J dxPr{x\y)x. Be- 
cause of the quadratic form of the objective function, it turns out that x'{y) 
may be treated as a free parameter whose optimum value (i.e. the solution of 
Q^f^ = 0) is / dxPT{x)x, as required. 

It was shown in |2J that the standard VQ [3J and topograpic mappings ^I] 
automatically emerge as special cases when D is minimised. In this approach, 
topographic mappings emerge as the optimal coding scheme when the code is 
to be transmitted along a noisy communication channel before being decoded 

2.2 High Dimensional Input Spaces 

A problem with the standard VQ is that its code book grows exponentially 
in size as the dimensionality of the input vector is increased, assuming that 
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the contribution to the reconstruction error from each input dimension is held 
constant. This means that such VQs are useless for encoding extremely high 
dimensional input vectors, such as images. The usual solution to this problem 
is to manually partition the input space into a number of lower dimensional 
subspaces, and then to encode each of these subspaces separately. However, it 
would be very useful if this partitioning could be done automatically, in such a 
way that typically the correlations within each subspace were much stronger than 
the correlations between subspaces, so that the subspaces were approximately 
statistically independent of each other. The purpose of this paper is to present 
a solution to this problem. 

The key step in solving this problem is to constrain the minimisation of D 
in such a way as to encourage the formation of code schemes in which each 
component of the code vector y codes a different subspace of the input vector 
X. There are two related constraints that may be imposed on Pi(y\x) and x'{y) 
which may be summarised as 

Pr{y\x) = Pr{yi\x) Pr{y2\x) ■ ■ ■ Pr(y„|a;) 

x\y)^^±x'{y.) 
1=1 

Thus each component yi (for i = 1, 2, • • • , rt and 1 < j/i < M) is an independent 
sample drawn from the codebook using Pr(yi|a;) (which is assumed to be the 
same function for all z), and the reconstruction vector x'(j/) (vector argument) 
is assumed to be a superposition of n contributions x'{yi) (scalar argument) 
for I = 1, 2, • • • , rt. Taken together, these constraints encourage the formation 
of coding schemes in which independent subspaces are separately coded, as 
required. 

The constraints in Equation |31 prevent the full space of possible values of 
Pr{y\x) or x' {y) from being explored as D is minimised, so they lead to an 
upper hound Di + D2 on the FMC objective function D (i.e. D < Di + D2), 
which may be derived as ^01 

Di =2.JdxPY{x) E Pr{y\x)\\x-x'{y)\f 

D2 =^^^JdxPi{x) 

Note that M (size of codebook) and n (number of samples drawn from codebook 
using Pr(y|a;)) are effectively model order parameters, whose values need to be 
chosen appropriately for each encoder optimisation problem. The properties of 
the optimum solution depend critically on the interplay between the statistical 
properties of the training data and the model order parameters M and n, as 
will be seen in the simulations in Sectional 
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Figure 2: A chain of linked FMCs, in which the output from each stage is its 
vector of posterior probabiUties (for all values of the code index) , which is then 
used as the input to the next stage. Only 3 stages are shown, but any number 
may be used. More generally, any acycHcally Hnked network of FMCs may be 
used. 



2.3 Chains of Linked FMCs 

The FMC illustrated in Figure may be generalised to a chain of Hnked FMCs 
as shown in Figure [2 Each stage in this chain is an FMC of the type shown 
in Figure ^ and the vector of probabilities (for all values of the code index) 
computed by each stage is used as the input vector to the next stage; there 
are other ways of Hnking the stages together, but this is the simplest possibility. 
The overall objective function is a weighted sum of the FMC objective functions 
derived from each stage. The total number of free parameters in an L stage chain 
is 3L — 1, which is the sum of 2 free parameters for each of the L stages, plus 
L — 1 weighting coefficients; there are L — 1 rather than L weighting coefficients 
because the overall normaHsation of the objective function does not affect the 
optimum solution. 

The chain of Hnked FMCs may be expressed mathematically by first of aH 
introducing an index I to allow different stages of the chain to be distinguished 
thus 
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D2 - 





(5) 



The stages are then defined and linked together thus 

X — I J-1 ,-^2 I 1 M(') I 

xf+^'^ = Pr(y(') = z|a;(')), 1 < i < M« 
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The objective function and its upper bound are then given by 

D ^Y. 

<Di+D2 (7) 

1=1 ^ ' 

where s^'-' > is the weighting that is appHed to the contribution of stage I of 
the chain to the overall objective function. 



3 Simulations 

In this section the results of various simulations are presented, which demon- 
strate some of the types of self-organising behaviour exhibited by an encoder 
that consists of a chain of linked FMCs. Synthetic, rather than real, training 
data are used in all of the simulations, because this allows the basic types of 
behaviour to be cleanly demonstrated. 

In Section ITTl the training data is described. In Section l3?2l a single stage 
encoder is trained on data that is a superposition of two randomly positioned 
objects. In Section rOI this is generalised to objects with correlated positions, 
and three different types of behaviour are demonstrated: factorial encoding 
using both a 1-stage and a 2-stage encoders fSection l3.4|l . joint encoding using 
a 1-stage encoder (Section |^}, and invariant encoding using a 2-stage encoder 
(Section EEll- 

In Appendix El the derivatives of the objective function are derived, and in 
Appendix El a gradient descent training algorithm based on these derivatives is 
presented. 

3.1 Training Data 

The key property that this type of self-organising encoder exhibits is its ability 
to automatically split up high-dimensional input spaces into lower-dimensional 
subspaces, each of which is separately encoded. This self-organisation manifests 
itself in many different ways, depending on the interplay between the statistical 
properties of the training data, and the 3 free parameters (i.e. the code book 
size M, the number of code indices sampled n, and the stage weighting s) per 
stage of the encoder (see Section [231 • 

In order to demonstrate the various different basic types of self-organisation 
it is necessary to use synthetic training data with controlled properties. All of 
the types of self-organisation that will be demonstrated in this paper may be 
obtained by training a 1-stage or 2-stage encoder on 24-dimensional data (i.e. 
M = 24) that consists of a superposition of a pair of identical objects (with 
circular wraparound to remove edge effects), such as is shown in Figure El 

In the simulations presented below, two different methods of selecting the 
object positions are used: either the positions are statistically independent, or 
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Figure 3: An example of a typical training vector for M — 24. Each object is 
a Gaussian hump with a half- width of 1.5 units, and peak amplitude of 1. The 
overall input vector is formed as a linear superposition of the 2 objects. Note 
that the input vector is wrapped around circularly to remove minor edge effects 
that would otherwise arise. 

they are correlated. In the independent case, each object position is a random 
integer in the interval [1,24]. In the correlated case, the first object position is 
a random integer in the interval [1, 24], and the second object position is chosen 
relative to the first one as an integer in the range [4, 8], so that the mean object 
separation is 6 units. 

3.2 Independent Objects 

The simplest demonstration is to let a single stage encoder discover the fact 
that the training data consists of a superposition of a pair of objects, which is a 
type of independent component analysis (ICA) 23|. This may readily be done 
by setting the parameter values as follows: code book size M = 16, number of 
code indices sampled n = 20, e = 0.2 for 500 training steps, e = 0.1 for a further 
500 training steps. 

The self-organisation of each of the 16 reconstruction vectors as training 
progresses (measured down the page) is shown in Figure 21 After some initial 
confusion, the reconstruction vectors self-organise so that each code index corre- 
sponds to a single object at a well defined location, whose width automatically 
adjusts itself so that the M reconstruction vectors cover the whole input space. 
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Figure 4: A factorial encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in independent locations. 

This behaviour is non-trivial, because each training vector is a superposition of 
a pair of objects at independent locations, so two different code index values 
must be sampled by the encoder (assuming that the two objects are not at the 
same location) ; the relatively large choice n = 20 ensures that it is highly likely 
that both code index values will be amongst the n random samples 0. This 
result is called a factorial encoder, because the objects are encoded separately. 

The case of a joint encoder, where each code index corresponds to a pair 
of objects at well defined locations, requires a rather large code book when the 
objects are independent. However, when correlations between the objects are 
introduced then the code book can be reduced to a manageable size, as will be 
demonstrated in the next section. 
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3.3 Correlated Objects 



If the positions of the pair of objects are mutually correlated, then they can be 
encoded in 3 fundamentally different ways: 

1. Factorial encoder. This encoder ignores the correlations between the ob- 
jects, and encodes them as if they were 2 independent objects. Each code 
index thus encodes a single object position, so many code indices must be 
sampled in order to virtually guarantee that both object positions are en- 
coded [2|. This result is a type of independent component analysis (ICA) 

2. Joint encoder. This encoder regards each possible joint placement of the 
2 objects as a distinct configuration. Each code index thus encodes a pair 
of object positions, so only one code index needs to be sampled in order 
to guarantee that both object positions are encoded [J]. This result is 
basically the same as what would be obtained by using a standard VQ p]. 

3. Invariant encoder. This encoder regards each possible placement of the 
centroid of the 2 objects as a distinct configuration, but regards all possible 
object separations (for a given centroid) as being equivalent. Each code 
index thus encodes only the centroid of the pair of objects. This type of 
encoder does not arise when the objects are independent. This is similar 
to self-organising transformation invariant detectors described in [2] • 




factorial ioint invariant 



Figure 5: Three alternative ways of using 30 code indices to encode a pair of 
correlated variables. The typical code cells are shown in bold. 

Each of these 3 possibilities is shown in Figure El where the diagrams are 
meant only to be illustrative. The correlated variables Hve in the large 2- 
dimensional rectangular region extending from bottom-left to top-right of each 
diagram. 

The factorial encoder has two orthogonal sets of long thin rectangular code 
cells, and the diagram shows how a pair of such cells intersect to define a small 
square code cell. The joint encoder behaves as a standard vector quantiser, and 
is illustrated as having a set of square code cells, although their shapes will not 
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be as simple as this in practice. The invariant encoder ideally has a set of long 
thin rectangular code cells that encode only the long diagonal dimension. 

In all 3 cases there is overlap between code cells. In the case of the factorial 
and joint encoders the overlap tends to be only between nearby code cells, 
whereas in the case of an invariant encoder the range of the overlap is usually 
much greater, as will be seen in the numerical simulations below. In practice 
the optimum encoder may not be a clean example of one of the types illustrated 
in Figure as will also be seen in the numerical simulations below. 

3.4 Factorial Encoding 

A factorial encoder may be trained by setting the parameter values as follows: 
code book size M = 16, number of code indices sampled n = 20, e = 0.2 for 500 
training steps, e = 0.1 for a further 500 training steps. 




Figure 6: A factorial encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in correlated locations. 
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The result is shown in Figure El which should be compared with the result 
for independent objects in Figure 0| The presence of correlations degrades the 
quality of this factorial code relative to the case of independent objects. The 
contamination of the factorial code takes the form of a few code indices which 
respond jointly to the pair of objects. 

The joint coding contamination of the factorial code can be reduced by using 
a 2-stage encoder, in which the second stage has the same values of M and n 
as the first stage (although identical parameter values are not necessary), and 
(in this case) both stages have the same weighting in the objective function (see 
Equation EJ. 




Figure 7: The factorial encoder is improved, by the removal of the joint encoding 
contamination, when a 2-stage encoder is used. 

The results are shown in Figure [3 The reason that the second stage encour- 
ages the first to adopt a pure factorial code is quite subtle. The result shown in 
Figure H will lead to the first stage producing an output in which 2 code indices 
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(one for each object) each typically have probability ^ of being sampled, and all 
of the remaining code indices have a very small probability (this is an approxi- 
mation which ignores the fact that the code cells overlap). On the other hand, 
Figure El will lead to an output in which the probability can be concentrated 
on a single code index, if it can jointly code the pair of objects. However, the 
contribution of the second stage to the overall objective function encourages it 
to encode the vector of probabilities output by the first stage with minimum 
Euclidean reconstruction error, which is easier to do if the situation is as in 
Figure rather than as in Figure El In effect, the second stage Hkes to see an 
output from the first stage in which more than one code index has a significant 
probability of being sampled, which favours factorial coding over joint encoding. 

3.5 Joint Encoding 

A joint encoder may be trained by setting the parameter values as follows: code 
book size M = 16, number of code indices sampled n = 3, e = 0.2 for 500 
training steps, e = 0.1 for a further 500 training steps, e = 0.05 for a further 
1000 training steps. This is the same as the parameter values for the factorial 
encoder above, except that n has been reduced to n = 3, and the training 
schedule has been extended. 

The result is shown in Figure After some initial confusion, the recon- 
struction vectors self-organise so that each code index corresponds to a pair of 
objects at well defined locations, so the code index jointly encodes the pair of 
object positions; this is a joint encoder. The small value of n prevents a factorial 
encoder from emerging [Jj. 

3.6 Invariant Encoding 

An invariant encoder may be trained by using a 2-stage encoder, and setting 
the parameter values identically in each stage as follows (where the weighting 
of the second stage relative to the first is denoted as s): code book size M = 16, 
number of code indices sampled n = 3, e = 0.2 and s = 5 for 500 training steps, 
£ = 0.1 and s = 10 for a further 500 training steps, e = 0.05 and s = 20 for a 
further 500 training steps, e = 0.05 and s = 40 for a further 500 training steps. 
This is basically the same as the parameter values used for the joint encoder 
above, except that there are now 2 stages, and the weighting of the second stage 
is progressively increased throughout the training schedule. Note that the large 
value that is used for s is off'set to a certain extent by the fact that the ratio 
of the normalisation of the inputs to the first and second stages is very large; 
the anomalous normalisation of the input to the first stage could be removed by 
insisting that the input to the first stage is a vector of probabilities, but that is 
not done in these simulations. 

The result is shown in FigureEl During the early part of the training schedule 
the weighting of the second stage is still relatively small, so it has the effect 
of turning what would otherwise have been a joint encoder into a factorial 
encoder; this is analogous to the effect observed when Figure El becomes Figure 
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Figure 8: A joint encoder emerges when a single stage encoder is trained on 
data that is a superposition of 2 objects in correlated locations. 

However, as the training schedule progresses the weighting of the second 
stage increases further, and the reconstruction vectors self-organise so that each 
code index corresponds to a pair of objects with a well defined centroid but 
indeterminate separation. Thus each code index encodes only the centroid of 
the pair of objects and ignores their separation. This is a new type of encoder 
that arises when the objects are correlated, and it will be called an invariant 
encoder, in recognition of the fact that its output is invariant with respect to 
the separation of the objects. 

Note that in these results there is a large amount of overlap between the code 
cells, which should be taken into account when interpreting the illustration in 
Figure^l This is an extreme example the second stage preferring an output from 
the first stage in which more than one code index has a significant probability 
of being sampled; the large amount of overlap between code cells means that 
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Figure 9: An invariant encoder emerges when 2-stage encoder is trained on data 
that is a superposition of 2 objects in correlated locations. 

many code indices have a significant probability of being sampled. 

4 Conclusions 

The numerical results presented in this paper show that a stochastic vector 
quantiser (SVQ) can self-organise to find a variety of different types of way of 
encoding high-dimensional input vectors. Three fundamentally different types 
of encoder have been demonstrated, which differ in the way that they build a 
reconstruction that approximates the input vector: 

1. A factorial encoder uses a reconstruction that is superposition of a number 
of vectors that each lives in a well defined input subspace, which is useful for 
discovering constituent objects in the input vector. This result is a type of 
independent component analysis (ICA) |14j . 
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2. A joint encoder uses a reconstruction that is a single vector that Uves 
in the whole input space. This result is basically the same as what would be 
obtained by using a standard VQ [H]. 

3. An invariant encoder uses a reconstruction that is a single vector that 
Hves in a subspace of the whole input space, so it ignores some dimensions of the 
input vector, which is therefore useful for discovering correlated objects whilst 
rejecting uninteresting fluctuations in their relative coordinates. This is similar 
to self-organising transformation invariant detectors described in (l^ . 

More generally, the encoder will be a hybrid of these basic types, depending 
on the interplay between the statistical properties of the input vector and the 
parameter settings of the SVQ. 



A Derivatives of the Objective Function 

In order to minimise Di + D2 it is necessary to compute its derivatives. The 
derivatives were presented in detail in ^01 for a single stage chain (i.e. a single 
FMC). The purpose of this appendix is to extend this derivation to a multi- 
stage chain of linked FMCs. In order to write the various expressions compactly, 
infinitesimal variations will be used thoughout this appendix, so that S{uv) = 
6uv + u5v will be written rather than — ft''^ + '"f| (for some parameter 9). 
The calculation will be done in a top-down fashion, differentiating the objective 
function first, then differentiating anything that the objective function depends 
on, and so on following the dependencies down until only constants are left (this 
is essentially the chain rule of differentiation) . 

The derivative of Di + D2 (defined in Equation 01 is given by 

<5 (^d['^ + = (^Sd['^ + 5D^^) (8) 

1=1 1=1 

The derivatives of the ^ and Dj'^ parts (defined in Equation 0) with appro- 
priate (I) superscripts added) of the contribution of stage I to Di +D2 are given 
by (dropping the (l) superscripts again, for clarity) 

=f/dxPr(x)E {^5Pr{y\x)\\x-x'{y)\f 
+2Pr(2;|a;) {^x"- ix' {y))) . {x - x' {y)) 

( 

5D2 = ^^^^ ^dxViix) [Sx^Y. {5Pr{y\x)x'{y) 



(9) 



y=i 

+ Vv{y\x)5x'{y))).(^- Y:Vv{y'\x)x'{y')^ 

The first step in modeUing Pr(y|a;) is to expHcitly state the fact that it is a 
probability, which is a non-negative normalised quantity. This may be done as 
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follows 

Pr(y|x) = -^^^^ (10) 

M 

where Q{y\x) > 0. The Q{y\x) are unnormalised probabilities, and J2 Qiv'l^) 

y' = l 

is the normaHsation factor. The derivative of Vx{y\x) is given by 



5Pr{y\x) _ 1 
Pr(y|a;) Q{y\x) 



M 

5Q{y\x)-¥r{y\x)Y,5Q{y'\x)\ (11) 

y' = l 



The second step in modelling Y'r{y\x) is to introduce an explicit parameteric 
form for Q{y\x). The following sigmoidal function will be used in this paper 

Q{v\x) = — -, rrrr (12) 

where w{y) is a weight vector and h{y) is a bias. The derivative of Q{y\x) is 
given by 

5Q{y\x) = Q{v\x) (1 - Q(y\x)) {5w{v).x + w{y).5x + 5h(y)) (13) 

This has reduced the 5Di and 5D2 derivatives to Sw{y), Sb{y), Sx'{y) and 
6x derivatives. The Sw{y), Sb{y) and 5x'{y) derivatives relate directly to the 
parameters being optimised and thus need no further simplification, however 
the 5x derivatives in Equation El and Equation E| need some further attention. 
The 5x derivative arises only in multi-stage chains of FMCs, and because of 
the way in which stages of the chain are linked together (see Equation it 
is equal to the derivative of the vector of probabilities output by the previous 
stage. Thus the 5x derivative may be obtained by following its dependencies 
back through the stages of the chain until the first layer is reached; this is 
essentially the chain rule of differentiation. This ensures that for each stage the 
partial derivatives include the additional contributions that arise from forward 
propagation through later stages, as described in Appendix IbI 



B Training Algorithm 

Assuming that Pr(j/|a;) is modelled as in appendix A (i.e. Pr(y|a;) 

and Q{y\ x) — i+cxp(— to(2^) 6(y)))' then the partial derivatives of Di + D2 with 
respect to the 3 types of parameters in a single stage of the encoder may be 
denoted as 

9^{y) ^'-il^ 



dw(y) 

M -^Ugl (14) 

9x{y) = dx'iy)^ 



16 



This may be generalised to each stage of a multi-stage encoder by including an 
(l) superscript, and ensuring that for each stage the partial derivatives include 
the additional contributions that arise from forward propagation through later 
stages; this is essentially an application of the chain rule of differentiation, using 
the derivatives g^m^yli)) dbS){yi')) ^^^^ stages together (see appendix 
A). 

A simple algorithm for updating these parameters is (omitting the (Z) super- 
script, for clarity) 

w{y) — ^w{y)-eS^^ 

h{y) ^h{y)-eS^ (15) 
x'{y) -^x'{y)-e^-^ 

where £ is a small update step size parameter, and the three normahsation 
factors are defined as 



max . /||g„fa)||^ 



y V dimx 



= "^l"" \b{y)\ (16) 



9x,o 



max 

y V dii 



The SjAmI and factors ensure that the maximum update step size for 

w{y) and x'{y) is edimx (i.e. e per dimension), and the factor ensures 

that the maximum update step size for b{y) is e. When a stationary point 
of Di + Z?2 is reached, the finite size of e prevents the parmatcr values from 
converging to a perfectly stationary solution, and instead they jump around in 
its neighbourhood. 

This update algorithm can be generalised to use a different e for each stage 
of the encoder, and also to allow a different e to be used for each of the 3 types of 
parameter. Furthermore, the size of e can be varied as training proceeds, usually 
starting with a large value, and then gradually reducing its size to obtain an 
accurate estimate of the stationary solution. It is not possible to give general 
rules for exactly how to do this, because training conditions depend very much 
on the statistical properties of the training set. 
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